Andrada Stefana
Astefanoaie, University of Konstanz, Andrada.Astefanoaie@uni-konstanz.de
Rodica Bozianu, University of Konstanz, Rodica.Bozianu@uni-konstanz.de
Roland Jungnickel, University of Konstanz, Roland.Jungnickel@uni-konstanz.de
Dr. Peter Bak,
University of Konstanz, Peter.Bak@uni-konstanz.de
For the analysis of the
data we used the data mining tool KINME. KNIME, is a modular data exploration
platform that enables the user to visually create data flows (often referred to
as pipelines), selectively execute some or all analysis steps, and later
investigate the results through interactive views on data and models. Also we
used R and some self-written Java programs. For the visualization of the
extracted data we used IBM's Many Eyes and the Protovis toolkit. Protovis
composes custom views of data with simple marks such as bars and dots. It uses
JavaScript and SVG for web-native visualizations.
Video:
http://ava.dbvis.de/MC2/AVA-final-video.mp4
ANSWERS:
MC2.1:
Analyze the records you have been given to characterize the spread of the
disease. You should take into consideration symptoms of the disease,
mortality rates, temporal patterns of the onset, peak and recovery of the
disease. Health officials hope that whatever tools are developed to
analyze this data might be available for the next epidemic outbreak. They
are looking for visualization tools that will save them analysis time so they
can react quickly.
In Mini Challenge 2 the task was to identify and characterize the spread
of an epidemic outbreak. At some steps we used visualizations in order to
extract the required information from the provided data.
For the analysis of the data we used the Konstanz Information Miner
(KINME) [1] data mining tool, R[2] and some self-written
Java programs. For the visualization of the extracted data we used IBM's Many
Eyes[4] and the Protovis[3] toolkit.
To
find the answers for both questions of the Mini Challenge we designed an
analytic pipeline which combines automatic and semi-automatic data analysis and
interactive visual explorations.
Figure 1. The analytic pipeline represents the workflow that was used to extract
the information needed to answer the Mini Challenge.
The analytic pipeline is
divided into three parts:
1.
preparation of data
2.
information
extraction
3.
result
2.1.1. Initial Analysis
The first part of the
analytic pipeline, common to both questions of Mini Challenge 2, is comprised
of data preprocessing and the initial analysis. The raw input data consists of
csv files containing data for 11
locations. For each location we were supplied with two csv files, one csv file
with hospitalized patients and one with dates about patients that died.
From the initial analysis we
concluded that the average mortality rate is 2.45%. Another observation we made
was the equal distribution of affected males and females. From this we
concluded that the gender is neutral.
2.1.2.Data Preprocessing
The
first preprocessing activity was to merge the two patient tables. The second
was to clean the values in the symptoms column. This process was done in
multiple steps. First, the records with more than one symptom were comma
separated. Second, we replaced abbreviations and duplicate symptoms in order to
have only one term for each specific symptom.
The initial analysis and data preprocessing took about
20 hours (8 hours to manually replace abbreviations, 12 hours to
programmatically generate necessary data).
The
Information Extraction phase is subdivided into two parts, data reduction
and visualization. The Information Extraction was done with the aforementioned
tools and took about 10 hours.
To
extract only the symptoms that characterize the disease, the number of
occurrences for each symptom in each day was counted for both hospitalizations
and deaths. Afterwards, the
mortality rate for each symptom was calculated. Using this data we discovered the
symptoms of the disease. To find the symptoms that characterize the disease we
used the treemap
presented in Figure 2.
Figure 2.
Treemap
representing the symptoms with the highest occurrence and a mortality rate >
1% for each location. Location is
encoded by color hue, occurrence of a symptom by area and mortality rate by
color saturation.
In a confirmatory step, the correlation
of the symptoms was calculated and visualized with the Arc Graph presented in
Figure 3. The visualization shows that some symptoms aren’t correlated with the
main symptoms, but it cannot be clearly stated that these symptoms have no
relation with the disease because of the high occurrence.
Figure 3. Arc Graph - shows the connection between the most common symptoms. The size of each node shows the number of occurrences of the symptoms and the weight of the arc represents the number of records with both symptoms
To gain more insight in the
temporal patterns of the disease, a Stack
hierarchy was used. To keep the visualization readily comprehensible, we focused on the symptoms discovered with the
help of the treemap.
Figure 4. Stack Graph - focuses
only on the main symptoms. The number of occurrences is represented by vertical
thickness. The x-axis represents time.
The
Stack hierarchy reveals that the most affected locations are Aleppo, Karachi
and Nairobi. In the second question of the Mini Challenge, we
confirm this fact by using other visualizations.
With
this visualization it is possible to see different aspects of the data, such as
the total number of people that died, and the number of people that
recovered for a given symptom in each location.
Results were extracted by
analyzing the visualizations that were created. Each
visualization helped to get intermediate results which determined the
next steps of the analytic process. Extracting the results took about 7 hours
total.
By
interpreting the treemap, the
characterizing symptoms were identified. They are divided into main symptoms -
abdominal pain, nose bleed, vomiting, vomiting blood, diarrhea
– and accompanying symptoms – back pain, fever, neck pain.
The outbreak lasted from
April 24th until April 30th. In this time period, the occurrence of the main symptoms
increased suddenly at the beginning (from April 28th), then decreased
briefly around May 6th just to start increasing again before reaching the peak.
The
peaks of the disease in the different locations are reached on the following
dates (without Turkey and Thailand, which we show in the second question of the
Mini Challenge to be unaffected):
1. Kenya
(Nairobi) – 16th May
2. Syria
(Aleppo) – 17th May
3. Lebanon,
Yemen and Pakistan (Karachi) - 19th May
4. Venezuela,
Saudi Arabia and Iran – 20th May
5. Colombia
– 21th May
Recovery
begins after each peak, and full recovery is reached around the 11th June for
all locations.
The
average mortality rate is 6,762296%, which is higher
than the initial average mortality rate that took into consideration all
hospitalized people. Without taking Thailand and Turkey into consideration it
increases to 7,36427%.
By
using existing tools it was possible to extract the needed information to answer
the questions of the Mini Challenge. We discovered the main symptoms of the
disease (abdominal pain, nose bleed, vomiting, vomiting blood, diarrhea), the over-all mortality rate about 6.76%, the onset of
the disease occurred from April 24th until April 30th, it reached the peak
between May 16th and May 21st and full recovery was reached around June11th.
4. References
[1]
KNIME (Konstanz Information Miner) - http://www.knime.org/
[2]
R - http://www.r-project.org/
[3]
Protovis - http://vis.stanford.edu/protovis/
[4]
Many Eyes - http://manyeyes.alphaworks.ibm.com/manyeyes/
MC2.2:
Compare the outbreak across cities. Factors to consider include timing of
outbreaks, numbers of people infected and recovery ability of the individual
cities. Identify any anomalies you found.
In Mini Challenge 2 the task was to identify and characterize the spread
of an epidemic outbreak. At some steps we used visualizations in order to
extract the required information from the provided data.
For the analysis of the data we used the data mining tool Konstanz
Information Miner (KINME) [1], R[2] and some self-written
Java programs. For the visualization of the extracted data we used IBM's Many
Eyes[4] and the Protovis[3] toolkit.
To
find the answers for both questions of the Mini Challenge we designed an analytic
pipeline which combines automatic and semi-automatic data analysis and
interactive visual explorations.
Figure 1. The analytic pipeline represents the workflow that was used to extract
the information needed to answer the Mini Challenge.
The analytic pipeline is
divided into three parts:
1.
preparation of data
2.
information
extraction
3.
result
2.1.1. Initial Analysis
The first part of the analytic
pipeline, common to both Mini Challenge questions, is
subdivided into an initial analysis of the data and the preprocessing. For the
initial analysis the raw input data which consists of csv files for 11 locations was used. For each location we
were supplied with two csv files, one with hospitalized patients and one with dates about patients
that died.
From
the initial analysis we concluded that the average mortality rate is 2.45%.
Another observation we made was the equal distribution of affected males and
females. From this we concluded that the gender is neutral.
2.1.2.Data Preprocessing
The
first preprocessing activity was to merge the two patient tables. The second
was to clean the values in the symptoms column. This process was done in
multiple steps. First, the records with more than one symptom were comma
separated. Second, we replaced abbreviations and duplicate symptoms in order to
have only one term for each specific symptom.
The initial analysis and data preprocessing took about
20 hours (8 hours to manually replace abbreviations, 12 hours to
programmatically generate necessary data).
The
Information Extraction phase is subdivided into two parts, data reduction
and visualization. The Information Extraction was done with the aforementioned
tools and took about 10 hours.
In the first
question of the Mini Challenge the characterizing symptoms of the disease were
identified with the help of the treemap in Figure 2
Figure 2.
Treemap
representing the symptoms with the highest occurrence and a mortality rate >
1% for each location. Location is
encoded by color hue, occurrence of a symptom by area and mortality rate by
color saturation.
Once the
symptoms were identified, the outbreaks and the peaks were established
according to the Small Multiples Chart in Figure 3.
Figure
3. Small Multiples Chart - shows in the first column
the evolution of each day for each location of the number of people that died
(by taking into consideration the day of hospitalization and the day of the death).
In the second column represents the total number of patients infected. The
third column shows the evolution of the percentage of people that died.
By visualizing the extracted information, we established
that in each country, the outbreak of the disease is almost in the same time,
namely between April 24th and April 30th. The first
two countries to experience the outbreak were Nairobi and Lebanon. The order of
the outbreaks is not clearly discernible because most cities have a high
difference in the number of patients from one day to another at the beginning
of the outbreak. Almost all cities have a small increase of the number of
patients after which it decreases and increases again. The order of the peaks
can be easily observed:
1. Kenya (Nairobi) –
16th May
2. Syria (Aleppo) – 17th
May
3. Lebanon, Yemen and
Pakistan (Karachi) - 19th May
4. Venezuela, Saudi
Arabia and Iran – 20th May
5. Colombia – 21th May
These peaks were confirmed in the first question of the
Mini Challenge.
Figure 3 reveals that Thailand and Turkey don’t have a
clear peak, and the number of patients and deaths is fluctuating very strongly.
From this observation we conclude that these countries were not affected by the
disease.
The World Map in Figure 4 was chosen to see
if there is any correlation between the location of a country and the values
and to compare the spread across cities.
The most affected country is Syria, being represented
by Aleppo. The next one is Kenya (represented by Nairobi) with a very small
difference.
Figure 4. World Map - shows the average mortaliy
rate per country, the recovery ability per country and the percentage of people
infected per country.
In all countries the average of people that were
hospitalized and infected with the disease is about 30%. The mortality rate in
the given locations is about 10%.
These
results were extracted in about 6 hours by interpreting the visualizations that
were created.
By using existing tools
it was possible to extract the needed information to answer the questions of
the Mini Challenge. We identified Nairobi as the first location to be infected
and to reach the peak. Aleppo is the most affected location and the different
locations have in general a recovery ability of about 90%. Turkey and Thailand
might be considered as anomalies because they weren’t affected.
4.
References
[1] KNIME (Konstanz
Information Miner) - http://www.knime.org/
[2]
R - http://www.r-project.org/
[3]
Protovis - http://vis.stanford.edu/protovis/
[4]
Many Eyes - http://manyeyes.alphaworks.ibm.com/manyeyes/